Replace special characters by non-special characters

Pikkel

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

Jul 17 '05 #1

Subscribe Reply

47794

Michael Fesser

.oO(Pikkel)

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

Maybe strtr()?

Micha

Jul 17 '05 #2

CJ Llewellyn

"Pikkel" <pi****@de.wo p> wrote in message
news:41******** *************** @news.xs4all.nl ...

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

http://uk.php.net/manual/en/function...ecialchars.php

Jul 17 '05 #3

Pikkel

CJ Llewellyn wrote:

"Pikkel" <pi****@de.wo p> wrote in message
news:41******** *************** @news.xs4all.nl ...
i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

http://uk.php.net/manual/en/function...ecialchars.php

Thanks for you tip, but i'm not looking for html replacement but
character replacement: á --> a

Jul 17 '05 #4

Pikkel

Michael Fesser wrote:

.oO(Pikkel)

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

Maybe strtr()?

Micha

i should replace all characters by myself using this function.
i was looking for a complete [accent, cedille, umlaut etc.] strip function

Jul 17 '05 #5

Andy Hassall

On Fri, 05 Nov 2004 22:08:03 +0100, Pikkel <pi****@de.wo p> wrote:

i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

In what character set encoding? If it's a small one, e.g. iso-8859-15, just
list all the accented/non-accented pairs and run it through strtr.

If it's a Unicode variant, it's bit more of a challenge...

--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #6

lawrence

Andy Hassall <an**@andyh.co. uk> wrote in message news:<tc******* *************** **********@4ax. com>...

On Fri, 05 Nov 2004 22:08:03 +0100, Pikkel <pi****@de.wo p> wrote:
i'm looking for a way to replace special characters with characters
without accents, cedilles, etc.

In what character set encoding? If it's a small one, e.g. iso-8859-15, just
list all the accented/non-accented pairs and run it through strtr.

If it's a Unicode variant, it's bit more of a challenge...

I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396

So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????

Jul 17 '05 #7

Andy Hassall

On 6 Nov 2004 01:19:52 -0800, lk******@geocit ies.com (lawrence) wrote:

I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396
There's probably a bit more to it than that, such as the encoding of the page
containing the form in the first place. If you just dump out ISO-8859-15
encoded data and pretend it's UTF-8, of course it won't work, except for the
shared ASCII (top bit not set, i.e. <= 127) representations between the two
encodings. I can't remember quite where you got to from the previous threads on
this subject though.
So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????

If it's between ISO-8859-15 and UTF-8, there are no characters unique to
ISO-8859-15, since UTF-8 encodes all those characters and more. Their encoding
differs for all those with encoding >127 from ISO-8859-15 but that's a
different question. The Euro is the same character in both, but has a different
encoding in both.

But anyway, it seems to me that the simple approach is just:

(1) Present the form in UTF-8 in the first place.
(2) The user copies content from one site, in whatever encoding. Their browser
places it on the clipboard in some OS-native encoding which is hopefully
irrelevant.
(3) The user pastes it into the UTF-8 form. The browser converts the characters
into the appropriate encoding.
(4) Post the data; since the source form is UTF-8, the data is sent in UTF-8,
and you're done.
(5) You can then just reject anything that comes in as malformed UTF-8 from the
previous step.

Consider:

Two scripts, one to output iso-8859-15 and the other Codepage 1252 (with the
dread Smart Quotes and all):

<?php header('Content-type: text/html; charset=iso8859-15'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characte rs to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
if ($i >= 127 && $i <= 159)
continue;

print htmlspecialchar s(chr($i), ENT_COMPAT, 'ISO-8859-15');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
<?php header('Content-type: text/html; charset=Windows-1252"'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characte rs to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
print htmlspecialchar s(chr($i), ENT_COMPAT, 'cp1252');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
Then utf8form.php, put text in, print back out encoded as utf-8:

<?php header('Content-type: text/html; charset=utf-8'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Outputti ng</title>
</head>
<body>
<pre>
<?php
if (isset($_POST['text']))
{
print htmlspecialchar s($_POST['text'], ENT_COMPAT, 'UTF-8');
}
?>
</pre>

<form method="post" action="utf8for m.php" accept-charset="utf-8">
<textarea name="text"></textarea>
<input type="submit">
</form>

</body>
</html>
In Firefox and IE6, this appears to work for me; copying all of the output
from the first pages, which was iso-8859-15 or Codepage 1252, and pasting into
the second page and submitting the form. The output is the same set of
characters, but UTF-8 encoded.

Also worked from other character set encodings; found a page encoded in
Shift-JIS and repeated the steps. The output looked the same to me (although I
can't read Japanese).

OK - that's the purist approach, when all the tools in the chain are
apparently handling encodings properly.

But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

So you've narrowed it down to one of the three single-byte character sets.

Then the major differences are:

Codepage 1252 has printable characters in the range 128-159 (with a couple of
gaps) wheras the iso8859 encodings only have non-printable characters there. So
if there's data in this range, odds are it's Codepage 1252 - so you can convert
it to UTF-8 from there.

This range holds the angled "smart" quotes, and the em-dash, which are the
characters that cause the most trouble. So alternatively, you could convert
them to plain quotes and dashes if you wanted.

If there's no characters in that range, then you haven't ruled out 1252, but
the rest of the encoding is pretty similar between 1252, iso8859-1 and
iso8859-15

See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
and -15, the main character worth worrying about most is the Euro (which is
somewhere else again in 1252 - in the 128-159 range I believe).

Is this any help?

--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #8

Pikkel

Andy Hassall wrote:

On 6 Nov 2004 01:19:52 -0800, lk******@geocit ies.com (lawrence) wrote:

I'm possibly beating this subject to death, but I've yet to think of
an answer to the problem. If a user copies text from a iso-8859-15
page and then pastes it into the textarea of a form and then submits
it to a CMS which then sends it out as UTF-8 one gets garbage
characters, as one can see on this page:

http://www.krubner.com/index.php?pageId=33396

There's probably a bit more to it than that, such as the encoding of the page
containing the form in the first place. If you just dump out ISO-8859-15
encoded data and pretend it's UTF-8, of course it won't work, except for the
shared ASCII (top bit not set, i.e. <= 127) representations between the two
encodings. I can't remember quite where you got to from the previous threads on
this subject though.

So I'm wondering if there is a way to cycle through and find quote
marks and such that are unique to iso-8859-15?????

If it's between ISO-8859-15 and UTF-8, there are no characters unique to
ISO-8859-15, since UTF-8 encodes all those characters and more. Their encoding
differs for all those with encoding >127 from ISO-8859-15 but that's a
different question. The Euro is the same character in both, but has a different
encoding in both.

But anyway, it seems to me that the simple approach is just:

(1) Present the form in UTF-8 in the first place.
(2) The user copies content from one site, in whatever encoding. Their browser
places it on the clipboard in some OS-native encoding which is hopefully
irrelevant.
(3) The user pastes it into the UTF-8 form. The browser converts the characters
into the appropriate encoding.
(4) Post the data; since the source form is UTF-8, the data is sent in UTF-8,
and you're done.
(5) You can then just reject anything that comes in as malformed UTF-8 from the
previous step.

Consider:

Two scripts, one to output iso-8859-15 and the other Codepage 1252 (with the
dread Smart Quotes and all):

<?php header('Content-type: text/html; charset=iso8859-15'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characte rs to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
if ($i >= 127 && $i <= 159)
continue;

print htmlspecialchar s(chr($i), ENT_COMPAT, 'ISO-8859-15');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
<?php header('Content-type: text/html; charset=Windows-1252"'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Characte rs to copy</title>
</head>
<body>
<pre>
<?php
$n = 0;
for ($i=32; $i<255; $i++)
{
print htmlspecialchar s(chr($i), ENT_COMPAT, 'cp1252');
if ($n++%16 == 15) print "\n";
}
?>
</pre>
</body>
Then utf8form.php, put text in, print back out encoded as utf-8:

<?php header('Content-type: text/html; charset=utf-8'); ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Outputti ng</title>
</head>
<body>
<pre>
<?php
if (isset($_POST['text']))
{
print htmlspecialchar s($_POST['text'], ENT_COMPAT, 'UTF-8');
}
?>
</pre>

<form method="post" action="utf8for m.php" accept-charset="utf-8">
<textarea name="text"></textarea>
<input type="submit">
</form>

</body>
</html>
In Firefox and IE6, this appears to work for me; copying all of the output
from the first pages, which was iso-8859-15 or Codepage 1252, and pasting into
the second page and submitting the form. The output is the same set of
characters, but UTF-8 encoded.

Also worked from other character set encodings; found a page encoded in
Shift-JIS and repeated the steps. The output looked the same to me (although I
can't read Japanese).

OK - that's the purist approach, when all the tools in the chain are
apparently handling encodings properly.

But are you after some more pragmatic approach, something like:

"The data my users send is probably iso8859-1, iso8859-15, codepage 1252, or
maybe utf-8, but it's likely been copied and mangled between applications so I
can't reliably tell which. How do I clean this data up in a reasonable way so
it can be converted to UTF8 for presentation on a UTF8 encoded page?"

If all the data has values <=127 then it's easy - that's all plain ASCII which
is a common subset of all four character sets.

You can at least rule out UTF-8 by using the functions posted in previous
threads looking for malformed UTF-8. If there's a significant number of
characters >127 and it all validates as UTF-8, then the odds of it probably
being UTF-8 increase the more characters above 127 there are, but it's still
not certain.

So you've narrowed it down to one of the three single-byte character sets.

Then the major differences are:

Codepage 1252 has printable characters in the range 128-159 (with a couple of
gaps) wheras the iso8859 encodings only have non-printable characters there. So
if there's data in this range, odds are it's Codepage 1252 - so you can convert
it to UTF-8 from there.

This range holds the angled "smart" quotes, and the em-dash, which are the
characters that cause the most trouble. So alternatively, you could convert
them to plain quotes and dashes if you wanted.

If there's no characters in that range, then you haven't ruled out 1252, but
the rest of the encoding is pretty similar between 1252, iso8859-1 and
iso8859-15

See http://en.wikipedia.org/wiki/ISO_8859-15 for the differences between -1
and -15, the main character worth worrying about most is the Euro (which is
somewhere else again in 1252 - in the 128-159 range I believe).

Is this any help?

It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

Jul 17 '05 #9

Andy Hassall

On Sat, 06 Nov 2004 22:54:00 +0100, Pikkel <pi****@de.wo p> wrote:

It's usefull information and I'll remember this. Thank you.
It's not the answer on my question wether there is a function which
converts characters with accents, umlauts and so on, to characters without.

True, it's drifted a bit to answer lawrence's questions.

As far as your question goes - no, there isn't a built in function, you'd have
to write one. In order to do so, you have to be a lot more specific about the
character encodings you're using, which characters you want to convert to what,
and exactly what "and so on" means in your last sentence.

--
Andy Hassall / <an**@andyh.co. uk> / <http://www.andyh.co.uk >
<http://www.andyhsoftwa re.co.uk/space> Space: disk usage analysis tool

Jul 17 '05 #10

Similar topics

16011

Replace accented chars with unaccented ones

by: Nicolas Bouillon | last post by:

Hi I would like to replace accentuel chars (like "Ã©", "Ã¨" or "Ã ") with non accetued ones ("Ã©" -> "e", "Ã¨" -> "e", "Ã " -> "a"). I have tried string.replace method, but it seems dislike non ascii chars... Can you help me please ? Thanks.

Python

4020

str.replace

by: Brian | last post by:

I want to use regxp to check that a form input contains at least 1 non-space charcter. I'd like to only run this if the browser supports it. For DOM stuff, I'd use if (documentGetElementById) {} Is there an object/feature detection I can use to check for regxp string manipulation support? --

Javascript

3221

RegEx : Match and replace term within HTML tags

by: mike c | last post by:

I have a search app that searches local HTML files for a specified term. I then display the pages that contain the term. I would like to highlight the search term within the HTML when it is viewed. I have the following regular expression code: string searchTerm = "(?<STARTTAG>(<*>.*))(?<MATCHTERM>(" + lastSearchTerm +...

C# / C Sharp

3642

Cannot parse an en-US date on a non-en-US system!

by: Jon Davis | last post by:

I have put my users through so much crap with this bug it is an absolute shame. I have a product that reads/writes RSS 2.0 documents, among other things. The RSS 2.0 spec mandates an en-US style of date formatting (RFC 822). I have been using a variation of RFC 1123 (just change the time zone to an offset, i.e. "-0800"). It seems to be...

C# / C Sharp

3338

Shared functions vs Non-Shared Functions

by: tshad | last post by:

I am setting up some of my functions in a class called MyFunctions. I am not clear as to the best time to set a function as Shared and when not to. For example, I have the following bit manipulation routines in my Class: ******************************************************************************* imports System NameSpace MyFunctions

ASP.NET

3833

Replace methode, Replace Function, Stringbuilder replace, Regex Replace, Split

by: Cor | last post by:

Hi Newsgroup, I have given an answer in this newsgroup about a "Replace". There came an answer on that I did not understand, so I have done some tests. I got the idea that someone said, that the split method and the regex.replace method was better than the string.replace method and replace function. I did not believe that.

Visual Basic .NET

7288

Str Replace

by: Michael | last post by:

In PHP there is a function called str_replace (http://php.net/str_replace). Basically you can freed in two strings and a "subject" string. Then it goes through the subject string searching for occurences of the "search" string and replaces them with the "replace" string. Is there something simular in JavaScript, or can someone give me a...

Javascript

9213

Javascript Replace?

by: Brad | last post by:

I see the use of Javascript replace all over the web. What are all the character sequences? (sorry I am a bit of a newbie at this). i.value.replace(/+/g, ''); I understand that /g is global and /i is case sensitive, but what are the rest? I am asking because I am trying to write a function that takes an input and replaces everything but...

Javascript

1792

Using Regex to find non-commented string

by: tawright915 | last post by:

Ok so here is my regex (--.*\n|/\*(.|\n)*?\*/). It finds all comments just fine. However I want it to return to me all strings that are not commented out. Is there a way to exclude the comments and only show the non-commented strings Here is an example of the data that I am working with /* select * from db2 */

C# / C Sharp

1214

string .replace function question (maintain caps)

by: Cirene | last post by:

How do I replace in a non-casesensitive way, but maintain the capitilization of this example.... mystring = "Hello WORLD! I want this to work!" mystring = mystring.Replace("world", "earth") What I want is this to be the outcome: "Hello earth! I want this to work!" I want "WORLD" to be replaced, regardless of the capitalization. I...

ASP.NET

7402

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main...

General

7347

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...

Windows Server

7733

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...

Online Marketing

7344

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...

Windows Server

5883

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...

Career Advice

5264

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes...

Microsoft Access / VBA

4890

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...

C# / C Sharp

3391

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

963

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP